Concrete Compressive Strength Prediction¶


Author: Irfan Ullah Khan
Date: 25/06/25
Description: This notebook explores the prediction of concrete compressive strength based on its composition and age using machine learning techniques.

Concrete Strength Prediction.jpg

In [ ]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_regression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import warnings
warnings.filterwarnings('ignore')
In [ ]:
# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

1. Data Loading and Initial Exploration¶

In [ ]:
# Load the dataset
df=pd.read_csv('concrete.csv')
In [ ]:
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
display(df.head())
Dataset shape: (1030, 9)

First 5 rows:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29
In [ ]:
# Basic statistics
print("\nDescriptive statistics:")
display(df.describe().T)
Descriptive statistics:
count mean std min 25% 50% 75% max
cement 1030.0 281.167864 104.506364 102.00 192.375 272.900 350.000 540.0
slag 1030.0 73.895825 86.279342 0.00 0.000 22.000 142.950 359.4
ash 1030.0 54.188350 63.997004 0.00 0.000 0.000 118.300 200.1
water 1030.0 181.567282 21.354219 121.80 164.900 185.000 192.000 247.0
superplastic 1030.0 6.204660 5.973841 0.00 0.000 6.400 10.200 32.2
coarseagg 1030.0 972.918932 77.753954 801.00 932.000 968.000 1029.400 1145.0
fineagg 1030.0 773.580485 80.175980 594.00 730.950 779.500 824.000 992.6
age 1030.0 45.662136 63.169912 1.00 7.000 28.000 56.000 365.0
strength 1030.0 35.817961 16.705742 2.33 23.710 34.445 46.135 82.6
In [ ]:
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
Missing values:
cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64
In [ ]:
# Check data types
print("\nData types:")
print(df.dtypes)
Data types:
cement          float64
slag            float64
ash             float64
water           float64
superplastic    float64
coarseagg       float64
fineagg         float64
age               int64
strength        float64
dtype: object

2. Exploratory Data Analysis (EDA)¶

In [ ]:
# Histograms of all features
df.hist(figsize=(15, 12), bins=20)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Boxplots for all features
plt.figure(figsize=(15, 8))
df.boxplot()
plt.xticks(rotation=45)
plt.title("Boxplots of Features")
plt.show()
No description has been provided for this image
In [ ]:
# Correlation matrix
plt.figure(figsize=(12, 10))
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Matrix")
plt.show()
No description has been provided for this image
In [ ]:
# Pairplot of selected features
sns.pairplot(df[['cement', 'water', 'age', 'superplastic', 'strength']])
plt.show()
No description has been provided for this image
In [ ]:
# Strength distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['strength'], kde=True)
plt.title("Distribution of Concrete Compressive Strength")
plt.xlabel("Strength (MPa)")
plt.show()
No description has been provided for this image
In [ ]:
# Relationship between age and strength
plt.figure(figsize=(12, 6))
sns.scatterplot(x='age', y='strength', data=df)
plt.title("Strength vs Age")
plt.show()
No description has been provided for this image
In [ ]:
# Relationship between cement and strength
plt.figure(figsize=(12, 6))
sns.scatterplot(x='cement', y='strength', data=df)
plt.title("Strength vs Cement Content")
plt.show()
No description has been provided for this image
In [ ]:
# Relationship between water and strength
plt.figure(figsize=(12, 6))
sns.scatterplot(x='water', y='strength', data=df)
plt.title("Strength vs Water Content")
plt.show()
No description has been provided for this image

3. Feature Engineering¶

In [ ]:
# Create new features
df['water_cement_ratio'] = df['water'] / df['cement']
df['aggregate_ratio'] = df['fineagg'] / df['coarseagg']
df['total_binder'] = df['cement'] + df['slag'] + df['ash']
df['total_aggregate'] = df['coarseagg'] + df['fineagg']
In [ ]:
# Log transform age to handle skewness
df['log_age'] = np.log1p(df['age'])
In [ ]:
# Check new features
print(df[['water_cement_ratio', 'aggregate_ratio', 'total_binder', 'total_aggregate', 'log_age']].describe())
       water_cement_ratio  aggregate_ratio  total_binder  total_aggregate  \
count         1030.000000      1030.000000   1030.000000      1030.000000   
mean             0.748266         0.801453    409.252039      1746.499417   
std              0.314005         0.115049     92.780669       101.235195   
min              0.266893         0.533369    200.000000      1457.000000   
25%              0.533333         0.736298    336.425000      1679.000000   
50%              0.675349         0.789855    391.300000      1752.900000   
75%              0.935165         0.891673    483.700000      1827.950000   
max              1.882353         1.164887    640.000000      1970.000000   

           log_age  
count  1030.000000  
mean      3.242209  
std       1.110431  
min       0.693147  
25%       2.079442  
50%       3.367296  
75%       4.043051  
max       5.902633  
In [ ]:
# Visualize new features
plt.figure(figsize=(15, 10))
plt.subplot(2, 2, 1)
sns.scatterplot(x='water_cement_ratio', y='strength', data=df)
plt.title("Strength vs Water-Cement Ratio")

plt.subplot(2, 2, 2)
sns.scatterplot(x='aggregate_ratio', y='strength', data=df)
plt.title("Strength vs Aggregate Ratio")

plt.subplot(2, 2, 3)
sns.scatterplot(x='total_binder', y='strength', data=df)
plt.title("Strength vs Total Binder Content")

plt.subplot(2, 2, 4)
sns.scatterplot(x='log_age', y='strength', data=df)
plt.title("Strength vs Log(Age)")

plt.tight_layout()
plt.show()
No description has been provided for this image

4. Data Preprocessing¶

In [ ]:
# Define features and target
X = df.drop('strength', axis=1)
y = df['strength']
In [ ]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [ ]:
# Create preprocessing pipeline
preprocessor = Pipeline([
    ('scaler', StandardScaler()),
    ('transformer', PowerTransformer(method='yeo-johnson')),
    ('selector', SelectKBest(score_func=f_regression, k=10))
])
In [ ]:
# Fit and transform the training data
X_train_preprocessed = preprocessor.fit_transform(X_train, y_train)
X_test_preprocessed = preprocessor.transform(X_test)
In [ ]:
# Get selected features
selected_features = preprocessor.named_steps['selector'].get_support()
print("Selected features:", X.columns[selected_features])
Selected features: Index(['cement', 'slag', 'water', 'superplastic', 'coarseagg', 'age',
       'water_cement_ratio', 'total_binder', 'total_aggregate', 'log_age'],
      dtype='object')

5. Model Building and Evaluation¶

In [ ]:
# Define models to evaluate
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42),
    'LightGBM': LGBMRegressor(random_state=42)
}
In [ ]:
# Evaluate each model
results = {}
for name, model in models.items():
    # Fit the model
    model.fit(X_train_preprocessed, y_train)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000249 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1300
[LightGBM] [Info] Number of data points in the train set: 824, number of used features: 10
[LightGBM] [Info] Start training from score 35.765570
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
In [ ]:
# Make predictions
y_pred = model.predict(X_test_preprocessed)
In [ ]:
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
In [ ]:
# Store results
results[name] = {
'RMSE': rmse,
'R2 Score': r2
}
In [ ]:
# Print results
print(f"{name}:")
print(f"  RMSE: {rmse:.4f}")
print(f"  R2 Score: {r2:.4f}")
print()
LightGBM:
  RMSE: 4.7674
  R2 Score: 0.9206

In [ ]:
# Convert results to DataFrame
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values('RMSE')
display(results_df)
RMSE R2 Score
LightGBM 4.767359 0.920571

6. Hyperparameter Tuning for Best Model¶

In [ ]:
# Select best model based on initial results
best_model_name = results_df.index[0]
print(f"Best model from initial evaluation: {best_model_name}")
Best model from initial evaluation: LightGBM
In [ ]:
# Define parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
    }
In [ ]:
# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=GradientBoostingRegressor(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)
In [ ]:
# Fit the grid search
grid_search.fit(X_train_preprocessed, y_train)
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Out[ ]:
GridSearchCV(cv=5, estimator=GradientBoostingRegressor(random_state=42),
             n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.05, 0.1],
                         'max_depth': [3, 5, 7], 'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [100, 200, 300]},
             scoring='neg_mean_squared_error', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=GradientBoostingRegressor(random_state=42),
             n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.05, 0.1],
                         'max_depth': [3, 5, 7], 'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [100, 200, 300]},
             scoring='neg_mean_squared_error', verbose=1)
GradientBoostingRegressor(max_depth=5, min_samples_leaf=2, min_samples_split=10,
                          n_estimators=300, random_state=42)
GradientBoostingRegressor(max_depth=5, min_samples_leaf=2, min_samples_split=10,
                          n_estimators=300, random_state=42)
In [ ]:
# Get best estimator and parameters
best_model = grid_search.best_estimator_
print("\nBest parameters:", grid_search.best_params_)
Best parameters: {'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 300}
In [ ]:
# Evaluate best model
y_pred_best = best_model.predict(X_test_preprocessed)
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))
r2_best = r2_score(y_test, y_pred_best)

print(f"\nBest model performance:")
print(f"  RMSE: {rmse_best:.4f}")
print(f"  R2 Score: {r2_best:.4f}")
Best model performance:
  RMSE: 5.0804
  R2 Score: 0.9098

7. Feature Importance Analysis¶

In [ ]:
# Get feature importance from the best model
feature_importance = best_model.feature_importances_
In [ ]:
# Create a DataFrame for visualization
features_df = pd.DataFrame({
    'Feature': X.columns[selected_features],
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)
In [ ]:
# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=features_df)
plt.title("Feature Importance from Gradient Boosting Model")
plt.tight_layout()
plt.show()
No description has been provided for this image

8. Residual Analysis¶

In [ ]:
# Calculate residuals
residuals = y_test - y_pred_best
In [ ]:
# Plot residuals
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x=y_pred_best, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted Values")

plt.subplot(1, 2, 2)
sns.histplot(residuals, kde=True)
plt.title("Distribution of Residuals")

plt.tight_layout()
plt.show()
No description has been provided for this image

9. Final Model Evaluation¶

In [ ]:
# Cross-validation scores
cv_scores = cross_val_score(
    best_model,
    preprocessor.transform(X),
    y,
    cv=5,
    scoring='neg_mean_squared_error'
)
In [ ]:
# Calculate RMSE from cross-validation
cv_rmse = np.sqrt(-cv_scores.mean())
print(f"Cross-validated RMSE: {cv_rmse:.4f}")
Cross-validated RMSE: 4.1944
In [ ]:
# Actual vs Predicted plot
plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred_best, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.xlabel("Actual Strength")
plt.ylabel("Predicted Strength")
plt.title("Actual vs Predicted Strength")
plt.show()
No description has been provided for this image

10. Model Interpretation with SHAP Values¶

In [ ]:
# Install shap if not available
try:
    import shap
except:
    !pip install shap
    import shap
In [ ]:
# Initialize SHAP explainer
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test_preprocessed)
In [ ]:
# Summary plot
plt.figure()
shap.summary_plot(shap_values, X_test_preprocessed, feature_names=X.columns[selected_features])
plt.show()
No description has been provided for this image

11. Conclusion and Recommendations¶

Key Findings:

  1. The most important features for predicting concrete strength are:

    • Cement content
    • Water-cement ratio
    • Age of the concrete
    • Superplasticizer content
  2. The Gradient Boosting model performed best with an RMSE of [value] and R2 score of [value].

Recommendations:

  1. To increase concrete strength:

    • Increase cement content (within practical limits)
    • Reduce water-cement ratio (while maintaining workability)
    • Use appropriate superplasticizers
    • Allow adequate curing time
  2. For future improvements:

    • Collect more data with varied mixtures
    • Experiment with additional features like curing conditions
    • Consider ensemble methods for improved accuracy """

Github Repo : https://github.com/programmarself/concrete_strength_predictor

Kaggle Notebook: https://www.kaggle.com/code/programmarself/concrete-strength-prediction

Live app Demo : https://concretestrengthpredictors.streamlit.app/


👨💻 By: Irfan Ullah Khan

GitHub Kaggle LinkedIn

YouTube Email Website)